1 Client

  • Susan G. Komen
  • A breast cancer organization in the United States.
  • Komen focuses on patient navigation and advocacy, providing resources for breast cancer patients to understand the American medical system. It has also funded research into the causes and treatment of breast cancer.
  • Linkedln
  • Website

2 Recommendation

The analysis highlights the importance of tumor size in breast cancer prognosis, with significant variations across N and T Stages. This emphasizes early detection’s criticality. For Susan G. Komen, enhancing screening and awareness campaigns is essential. Also, the link between larger tumors and increased regional node positivity calls for personalized treatment and comprehensive nodal assessments. By integrating hypothesis testing results, emphasizing tumor size and stage correlation, Susan G. Komen can bolster its role in promoting breast cancer detection, treatment, and research. This strategic focus is pivotal in improving patient outcomes and driving impactful changes in patient care.

3 Evidence

3.1 Initial Data Analysis (IDA)

  • This dataset of breast cancer patients was obtained from the 2017 November update of the SEER Program of the NCI. The source of this data set is published by Reyhaneh Namdari.
  • For more information about this dataset, you can assess the links below.
  • Source
  • Additional information

3.2 Identifying the Stage and Conditions for Peak Tumor Size.

3.2.1 Figure 1: Tumor size comparison by N stage

From the box plots, N3 stages have the biggest median tumor sizes, which alligns with the expectation that more advanced lymph node involvement can be associated with larger primary tumor. The presence of outliers in all stages also highlights the variability in tumor sizes within each N stage.

3.2.2 Figure 2: Tumor size comparison by T stage

From the violin plots, The T3 stage having the highest median tumor size is noteworthy. This suggests that, patients in the T3 stage tend to have larger tumors on average compared to other stages.

3.2.3 Figure 3: Relationship between Regional Node Positivity and Tumor Size

## Warning in geom_point(aes(x = mean(Reginol.Node.Positive), y = mean(Tumor.Size)), : All aesthetics have length 1, but the data has 4024 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.
## [1] 0.24
##                   (Intercept) my_data$Reginol.Node.Positive 
##                      26.30875                       1.00165

\[Correlation = 0.24 \text{ (weak positive correlation)}\]

\[y = 1.002x + 26.309 \text{ (where x = Regional Node positivity and y = Tumor size (mm))}\] In summary, there is a weak positive correlation in the data, showing that for each node detected to be positive, tumor size increase approximately 1.002 (mm).

3.2.4 Figure 4: Residual graph of tumor size ~ Regional node positivity

However, judging by the residual plots given here, the linear model is not a great fit as it is not homoscedastic. This might implies weak correlation.

3.3 Hypothesis testing( Two Sample T Test at the 5% significance level)

Looking at the boxplots generated, there is a clear trend that patients with distant cancer have larger tumors. To investigate whether this correlation is due to chance or not, we use two sample T-test at the 5% significant level.

Set up research question

Does A stage impact the tumor size of patients?

H: Hypothesis \(H_{0}\) vs \(H_{1}\)

Let \(\mu_{1}\) is the mean of the tumor size of distant in A stage.

Let \(\mu_{2}\) is the mean of the tumor size of regional in A stage.

Thus, \[ H_{0} \text{: There is no difference between } \mu_{1} and \mu_{2}(\mu_{1} = \mu_{2})\]

\[ H_{1} \text{: There is a difference between } \mu_{1} and \mu_{2}(\mu_{1} \neq \mu_{2})\]

T: Test statistic and P: P-value

## 
##  Two Sample t-test
## 
## data:  A1_data$Tumor.Size and A2_data$Tumor.Size
## t = -7.9175, df = 4022, p-value = 3.109e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -21.83660 -13.16857
## sample estimates:
## mean of x mean of y 
##  30.07350  47.57609

Conclusion

  • T (Test Statistic): The observed test statistic is \(t = -7.9175\)

  • P (P-value): The p-value is \(3.109e-15\)

  • Statistical conclusion: As the p-value < 0.05, we reject the null hypothesis.

  • Scientific conclusion: The data suggest that A stage have effect on the tumor size.

4 Integration of External Evidence

Carter, Allen, and Henson (1989), Relation of tumor size, lymph node status, and survival in 24,740 breast cancer cases.

Carter, Allen, and Henson (1989), Accuracy of the extent of axillary nodal positivity related to primary tumor size, number of involved nodes, and number of nodes examined;.

Koscielny et al. (2009), Impact of tumour size on axillary involvement and distant dissemination in breast cancer.

5 Acknowledgments

Sievert (2020), Interactive web-based data visualization with R, plotly, and shiny.

Yihui Xie (2023), R Markdown: The Definitive Guide.

Qiu (2021), Creating Pretty Documents from R Markdown.

Examplar 1

Examplar 2

Examplar 3

6 References

Carter, Christine L, Carol Allen, and Donald E Henson. 1989. “Relation of Tumor Size, Lymph Node Status, and Survival in 24,740 Breast Cancer Cases.” Cancer 63 (1): 181–87.
Koscielny, S, R Arriagada, J Adolfsson, T Fornander, and J Bergh. 2009. “Impact of Tumour Size on Axillary Involvement and Distant Dissemination in Breast Cancer.” British Journal of Cancer 101 (6): 902–7.
Qiu, Yixuan. 2021. “Creating Pretty Documents from r Markdown the Tactile Theme.” https://cran.r-project.org/web/packages/prettydoc/vignettes/tactile.html.
Sievert, Carson. 2020. “Interactive Web-Based Data Visualization with r, Plotly, and Shiny.” https://plotly-r.com.
Yihui Xie, Garrett Grolemund, J. J. Allaire. 2023. “R Markdown: The Definitive Guide.” https://bookdown.org/yihui/rmarkdown/.

7 Appendix

7.1 Client Choice

The report offers substantial value to Susan G. Komen:

Research Advancement: Insights on tumor size variations and regional node positivity provide a deeper understanding of breast cancer progression, potentially guiding future research. This aligns with Susan G. Komen’s research goals, making your report a valuable resource.

Awareness Campaigns: Your emphasis on early detection and the significance of tumor size in prognosis dovetails with their awareness initiatives. The findings could refine their campaigns, highlighting the need for early screening and public education on key indicators.

Treatment Advocacy: The data-driven insights inform advocacy for personalized treatments and comprehensive nodal assessments, supporting patient-centric healthcare policies.

Overall, the report aligns with their mission, enhancing efforts and providing insights to further their impactful work against breast cancer.

7.2 Statisitcal Analyses

7.2.1 Linear modelling

Linear modelling was chosen to demonstrate the association between tumor size and regional Node positivity . Assumption tests:

  • “Eye-test”. The relationship are not linear (Figure 3), however due to with large sample size normality is assumed (\(n = 4024\))

  • Residual plots (Figure 4): the linear model is not a great fit as it is not homoscedastic. This might implies weak correlation.

7.2.2 Hypothesis testing( Two Sample T Test at the 5% significance level)

7.2.2.1 Set up research question

Does A stage impact the tumor size of patients?

H: Hypothesis \(H_{0}\) vs \(H_{1}\)

Let \(\mu_{1}\) is the mean of the tumor size of distant in A stage.

Let \(\mu_{2}\) is the mean of the tumor size of regional in A stage.

Thus, \[ H_{0} \text{: There is no difference between } \mu_{1} and \mu_{2}(\mu_{1} = \mu_{2})\]

\[ H_{1} \text{: There is a difference between } \mu_{1} and \mu_{2}(\mu_{1} \neq \mu_{2})\]

7.2.2.2 Weigh up evidence

A: Assumption
  • Assuming that two sample are independent and they are big enough to present the hole population.

  • The report assume that the 2 populations have the same variation in tumor size.

## 
##  F test to compare two variances
## 
## data:  A2_data$Tumor.Size and A1_data$Tumor.Size
## F = 1.8305, num df = 91, denom df = 3931, p-value = 6.77e-06
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  1.391156 2.512025
## sample estimates:
## ratio of variances 
##           1.830517
  • The report assume that the 2 populations have Normally distributed tumor size.

  • QQ plots. Generally, the values increase linearly, suggesting normal distribution. Some deviation is observed at the extremities; however, due to the large sample size normality is assumed, as stated in central limit theoreom. (Figure 5, 6)

Figure 5

Figure 6

T: Test statistic and P: P-value
## 
##  Two Sample t-test
## 
## data:  A1_data$Tumor.Size and A2_data$Tumor.Size
## t = -7.9175, df = 4022, p-value = 3.109e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -21.83660 -13.16857
## sample estimates:
## mean of x mean of y 
##  30.07350  47.57609

7.2.2.3 Explain conclusion

Statistical conclusion:

As the p-value < 0.05, we reject the null hypothesis.

Scientific conclusion:

The data suggest that A stage have effect on the tumor size.

7.3 Limitations

  • In the assumption, we assume that 2 populations have Normally distributed tumor size, however, from the QQ plots, some deviation is observed at the extremities (Figure 5, 6)

  • Week correlation between Regional Node Positivity and Tumor Size (Figure 4)

  • Influence of Other Prognostic Factors

  • Variability in Measurement